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Abstract 

We study the weight space structure of the parity machine with binary 

weights by deriving the distribution of volumes associated to the internal 

representations of the learning examples. The learning behaviour and the 

symmetry breaking transition are analyzed and the results are found to be in 

very good agreement with extended numerical simulations. 
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I. INTRODUCTION 



The understanding of the learning process of neural networks is of great importance from 
both theoretical and applications points of view [|I|. While the properties of the simplest 
neural network, the perceptron, are now well explained, the picture we have for the learning 
phase of the far more relevant case of multilayer neural networks remains unsatisfactory. 
Due to the internal degrees of freedom present in multilayer networks (the state variables of 
the hidden units), the structure of the weight space inherited from the learning procedure 
is highly non trivial §0g|g@. 

Gardner's framework of statistical mechanics || has been proven to be useful in under- 
standing the learning process by providing some bounds on the optimal performances of 
neural networks. In particular, it has allowed to derive the storage capacity and the gen- 
eralization abilities of neural networks inferring a rule by example. However, the drawback 
of such an approach is that it does not give any microscopic information concerning the 
internal structure of the coupling space, in particular about internal representations. 

Recently, an extension of Gardner's approach has been proposed which leads to a 
deeper insight on the structure of the weight space by looking at the components of the latter 
corresponding to different states of the internal layers of the network. Such an approach has 
been successful in explaining some known features of multilayer neural networks and has 
permitted to find some new results concerning their learning-generalization performances 
as well as to make a rigorous connection with information theory [^i[lT|. 

In this paper we focus on multilayer neural networks with binary weights [||]. This 
allows us to compare the analytical study with extensive numerical simulations and thus to 
provide a concrete check of the liability of the theory. Indeed, both the structure of internal 
representations and the (symmetry-breaking) learning phase transition predicted by our 
theory turn out to be in remarkable agreement with the numerical findings. 

The paper is organized as follows. In section 2, we present our method from a general 
point of view and apply it to the parity machine with binary weights in Section 3. Section 
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4 is devoted to numerical simulations. Our results are summed up in the conclusion. 



II. DISTRIBUTION OF THE INTERNAL REPRESENTATION VOLUMES 

As discussed in ref. ||, the method we adopt consists in a rather natural generalization 
of the well known Gardner approach based on the study of the fractional weight space 
volume not ruled out by the optimal, yet unknown, learning process ||. We analyze the 
detailed decomposition of such volume in elementary volumes each one associated to a 
possible internal representations of the learned examples. The dynamical variables entering 
the statistical mechanics formalism are the (binary valued) interaction couplings and the 
spin-like states of the hidden units. 

In what follows, we focus on non-overlapping multilayer networks composed of K per- 
ceptrons with weights and connected to K sets of independent inputs (£ = 1, . . . , K, 
i = l,...,N/K). 

The learning process may be thought of as a two step geometrical process taking place in 
the weight space from the input to the hidden layer. First the iV/ .fT-dimensional subspace 
belonging to the I— th perceptron (or hidden unit) is divided in a number of volumes (< 2 P ), 
each of which being labeled by a P-components vector 

7f = sig n (j/n , e = i,...,K , fi = i,...,p . (i) 

Tg is the spin variable representing the state of the E-th hidden unit when pattern number 
/i is presented at the input. Next, the solution space is defined as the direct product of the 
volumes belonging to all hidden nodes and satisfying the condition imposed by the decoder 
function 

/(K}) = ^ , (2) 

where is the output classifying the input pattern. The overall space of solution is thus 
composed by a set of internal volumes Vr identified by the K x P matrix called inter- 
nal representation of the learning examples. The computation of the whole distribution of 
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volumes Vr, both their typical size and their typical number, yields a deeper understanding 
on the storage problem by the comparison of the number exp(J\f D ) of volumes giving the 
dominant contribution to Gardner's volume with the upper bound given by total number 
exp(A/ii) of non-empty volumes (i.e. the total number of implementable internal represen- 
tations). Moreover, the physics of the learning transition (the freezing phenomena and the 
replica symmetry breaking transition) acquires a detailed geometrical interpretation. 

Here we consider the case of Parity Machines which are characterized by a decoder 
function defined as the product of the internal representation, a 11 = f({re}) = Yle T £- 

As mentioned, given a set of P = aN binary input-output random relations, the learning 
process can be described as a geometrical selection process aimed to finding a suitable set 
of internal representations T = {rjf} characterized by a non zero elementary volume Vr 
defined by 

vr= e n^/^m^zW) , (3) 

j H =±i p. n,e \ i J 

where #(...) is the Heaviside function. The overall volume of the weight space available for 
learning (the Gardner volume Va) can be written as 

V g = '£Vt ■ (4) 
r 

For the learning problem, the distribution of volumes can be derived through the free- 
energy 



(5) 



by calculating the entropy J\f[w(r)] of the volumes Vr whose inverse sizes are equal to 
w(r) = — jflnVr, given by the Legendre relations 

»M = ^, ^Wr)] = -*W . (6) 



When iV — > oo, -^ln(V(j) = — g(r = 1) is dominated by volumes of size w(r = 1) 
whose corresponding entropy (i.e. the logarithm of their number divided by N) is Md — 



M[w(r = 1)] and, at the same time, the most numerous ones are those of smaller size 
w(r = 0) (since in the limit r — > all the T are counted irrespective of their relative vol- 
umes) whose entropy Mr = M[w(r = 0)] is the (normalized) logarithm of the total number 
of implementable internal representations. Both Md and Mr allow to built a rigorous link 
between statistical mechanics and information theory. The former (Md) coincides with the 
quantity of information X = — J2t l°g p§ contained in the internal representation distri- 
bution T and concerning the weights whereas the latter (Mr) is the information capacity of 
the system, i.e. the maximal quantity information one can extract from the knowledge of 
the internal representations |§. 

III. ANALYTICAL CALCULATION FOR THE BINARY PARITY MACHINE 

In the following, we shall apply the above method to derive the weight space structure of 
the non-overlapping parity machine with binary couplings. The analysis of binary models || 
is indeed more complicated than that of their continuous counterpart due to Replica Sym- 
metry Breaking (RSB) effects. However, in the binary case extensive numerical simulations 
on finite systems become available allowing for a very detailed check of the theory. 

In the computation of g(r), M [w(r)] and w(r) one assumes that, due to their extensive 
character, the self-averaging property holds. We proceed in the computation of the g(r) 
following the scheme presented in [@,|ll|] and discussed above. The basic technical difference 



with the standard Gardner approach resides in the double analytic continuation inherited 
from the presence of two sets of replica indices in the weight vectors. The first coming from 
the integer power r of the internal volumes appearing in the partition function, the second 
from the replica trick. 

The replicated partition function reads 




e En 

{7f } {Jg"} 



fj,a\ 



(7) 



with v = 1, . . . ,r and a = l,...,n and which in turn implies the introduction of four 



sets of order parameters. In the above formula, with no loss of generality, we have posed 
a» = 1 , V//. 

At variance with Gardner's approach, the partition function (|7|) requires a double con- 
figuration trace, over the internal state variables and the binary couplings. We find 



g(r) = -Extr g<<5/ - T{QiQi) 



(8) 



where T reads 

HQtQi) = 



E Tr (QiQi) + ^ E in 



JiQiJi 



+ 



+ a In 



rr w e IK /I! 



n 



(9) 



with a^, x^, (n x r)-dimensional vectors. The elements of the (n x r) x (n x r) matrices 
e are the overlaps 



ft 



a, vi, ,9,^2 



ETQVl t/3v 2 



iV 



(10) 



between two coupling vectors belonging to the same hidden unit I and their conjugate 
variables. The simplest non trivial Ansatz (which can be physically understood within the 



cavity approach [1Q|) on the structure of the above matrices, the Replica Symmetric (RS) 
Ansatz of our approach, must distinguish elements with a = (3 or a ^ j3, whereas ignores 
difference between replica blocks and between hidden units. The matrices Qi, Qe become 
independent of t and with elements 



a=/3,z/i,v 2 _ * *a=l3,i/ 1 ,i>2 _ ~* 

He — H i Hi — H 

Qe — Qo , Qe — Qo 



(11) 



We then find 



g(r,q ,q ,q*,q*) 



1 . 1/ ,v *.* 1., 
-^rq q + ~(r - l)q q + -q - 



- J Ax In J Ay (2 cosh(^ x + yjq* - q y)) r - 

-- J n ^ m^, n y * * ( — V1 _^ v ) 



i=i 



(12) 



where we have posed Tr^ Te y = Tr{ n y9(Y[t T i)i Ax = exp(— x 2 /2)/\^2ti and H{y) = Ax. 
One may notice that the above expression evaluated for r = 1 reduces to the RS Gardner's 
like result on the parity machine H independent on the parameters q* and q* 



g(r = 1, g , go, q*, q*) = G RS (q , go) = ~J^ ln V G ■ 



(13) 



where Vq is the Gardner volume. The geometrical organization of the domains is thus hidden 
in the Gardner volume and shows up only when r ^ 1 or if derivatives with respect to r are 
considered, leading to an explicit dependence on the order parameters g*,g* 



dg 

g (r = l + e,q ,q ,q*,q*) = G RS (q , q ) + e— (q , g , <f , q*) | r=1 • 



(14) 



In particular, the functions Af [w(r = 1)] and w(r = 1), being derivatives of g(r), will depend 
on q* and q*. 

The RS saddle point equations read: 



9g(r) 
dqo 



: 



go 



2 ) W = : 



Ax 



/ Ay (cosh r (v/go"x + y/q* -q y)) taiah(^x + y/q* - q y 



Ax 



J Ay cosh r (v/go"x + y/q* - q y) 



J Ay cosh r ( v / g "x + y/q* - q y) tanh^y/gpx + y/q* - q y) 
f Ay cosh r ( v /g ~x + \/g* - g y) 



3) 



9g(r) 
dqo 



go 



: 



aK 



2?r(l -q*) 



Tr {Te} n nl 2 .f^eH r (A e ) J Ax 1 H r -\A 1 ) e - A ( 



Tr {Te} UeJAx £ H r (A e 



in which 



A, 



y/g* -q x e + y/%r e y E 



4) = : 



dq* 

g 



"2tt(1 - q* 



(15) 



(16) 



(17) 



(18) 



Y Tr {^i j if r (^) 7 



The case of the parity machine is relatively simple in that a consistent solution for the 
first two equations leads qo = and g = (as it happens in the computation of Vq f§0), 
which means that the domains remain uncorrelated during the learning process. The latter 
two equations simplify to 



* = /Ay cosh^Xvgj/) tanh%y/gj/) 
J Ay cosh r (y / g* 



with a free energy given by 



(21) 



1 <f 1 f r- 

g(r,q*,q*) = -~(1 -r)q*q* + — - - In J Ay 2 r cosh r (y/q*y) + 

- ^(K -l)ln2-^Kln J Ax Hr( ] l^x) . (22) 

For the parameters there are two kinds of solution: a first one q* — 1, q* — oo, 

which leads w(r) = and A/" [w(^)] = (1 — a) In 2 independently on r. The second kind must 
be computed numerically form (|2CiD and (pip. 

In the replica theory, the choice of the right saddle solution, i.e. the maximization or the 
minimization of the free energy, is not completely straightforward due to the unusual n — » 
analytic continuation Here we must deal with a double analytic continuation and the 
overall criterion that must be followed is given by 

r < , g -> MAX, q* -> MIN 

< r< 1 , g -> MAX, g* -> MAX 

r > 1 , g -> MAX, g* -> MIN , (23) 

where MAX or MIN indicates whether one must chose the solution which maximizes or 
minimizes the free energy g(r) respectively. 

Like the zero entropy criterion for the binary perceptron, the behaviour of J\f[w(r)] and 
w(r) (the cases r = and r = 1 being of particular interest) tells us when the RS Ansatz 
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breaks down. Notice that in the binary case also the volume size w(r) assumes the role of an 
entropy in that it coincides with (minus) the logarithm of the normalized number of binary 
weight vectors belonging to a domain. 

The Legendre transforms (0) of g(r) lead to the formulas 



w{r) = rq q + -q [1 — q 



J Ay cosh r (^/q*y) ln(2 cosh(\/g* y)) 
J Ay costf (yfq*y) 



a K- 



JAxH r U^-x)\nHU^-x 



JAxHr(J£- x 



(24) 



and 



Af[w(r)) 



-q*q* + In 



J Ay 2 r cosh r (yg* y) 



J Ay cosh r (v^?/) ln(2cosh(VF?/)) , (ir ^, QJ 
r — ——^=— s + a{K — 1) In 2 + 



aK In 



/ Ay cosh r (y / g* y) 
Ax H r (J-^—x) 



1 — q* 



aKr 



J Ax H r U^-x)\nHU^-x) 



J Ax H'U^x) 



(25) 



The number Md of domains composing Vq is given by N[w(r = 1)] = —g(l) + w(l): 



M\w(r = 1)1 = ¥-(q* + 1) - cosHVFy) ln(2cosh(v^y)) + 

2 / Ay cosh(y/q*y) 



+ (1 - a) In 2 - 2aK / Aa; # 
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-a?) \nH( 



-x 



(26) 



1 — q* y 1 — g* 

The number A/jj of the most numerous domains, i.e. the total number of implementable 
internal representations, is given by the limit r = 0. We find 



w(r = 0) = ^<f (1 -q*)- J Ayln(2cosh(yg*y)) - a K J Ax lnH(J 9 



x 



(27) 



and 



TV" [w(r = 0)] = a(Jfe - 1) In 2 + aKln 



— h lim 

2 r^OJo 



00 (fx 2 (l- q * +rq *) 



2(1-9*) 



(28) 



The second term of the r.h.s. of above expression is different from zero only if lim 



■r-»0 



1-q* 



const., as it happens in the continuous case [Q]. In both the continuous and binary cases, 
beyond a certain value an of a, the number of internal representations which can be realized 
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becomes smaller than 2^ K ~^ P as the domains progressively disappear. However, in the 
binary case the parameters q* does not vanish continuously and a first order RSB transition 
to a theory described by two order parameters ql, q^ is required. 

At the point where the w(r) vanishes the RS Ansatz must be changed. Following the 
same RSB scheme as in 0, the one step RSB expression is obtained by breaking the sym- 
metry within each elementary volume and introducing the corresponding order parameters 
(<2o> <?oj 9*> m ) 1X1 pl &ce °f (<?*> <?*)• The free energy reads 

9RSB{qo = 0,q = 0,q* ,q* ,ql,ql,r,m) = -{qlq\{m - 1) + q\ + q^r - m)) - 



Ay y Az(2 m cosh m (^j*y+^qt-q* z) 



-{K-l)hx2- — In ( Ay [ AzH 
r r J J 



n* — M_ 



% = Qc , 

T 

m — — , 



(29) 



As for the binary perceptron, posing q* — 1 leads g* = oo and 

9B,SB(qo = 0, g = 0, q%, q{ = 1, q{ = oo, m, r) = ^(g^o ( r ~ m )) + 1o m ) ~ 

-In I Ay 2»cosh™(v / $y m )--(AT- 1) In 2- — In ( AyH^(-^L=) . (30) 
r 7 v r r J y/1 — q$ 

Therefore, we may also write 

9RSB(qo,qo,q*i = = oo, m,r) = —g RS (q* = q*m 2 ,q* = q* ,r' = r/m) . (31) 

777* 

The saddle point equation with respect to m reads 

dQRSB ]_/ ^ %gs \ _ q , 32 n 

9m m 2 9r' 

Such equation is nothing but the condition 

^(g* = q* m 2 , q* = q*, r' = r/m) = , (33) 
that, in order to be satisfied, requires 



(34) 
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where the parameters values q*, q* and r c are computed at the w = transition point. From 
the relations 

drg RSB _ ]__d_ , _ ^ dg RSB _ , 2 dg RS , . 

dr m dr' ' dr dr' ' 

it follows 



w MB( r ) = ^w RS (r c ) = , Af RSB [w RSB (r)\ = Af RS [w RS (r c )\ . (36) 

In Fig.l we show the behaviour of rg(r) versus r for a = 0.33. The part of the curve with 
positive slope cannot exists and hence beyond the r c value the function remains constant 
and equal to J\f w RS (r c ) . 

Just like in the binary perceptron |TJ] or in the Random Energy Model [0 (for which 



the one step RSB solution is exact), below r c and for fixed a, the system is completely 
frozen. The function rg(r) behaves like the free energy of the above mentioned systems 
though in such cases the freezing takes place with respect to the temperature and beyond 
the critical temperature the free energy is equal to the constant value of the internal energy. 
The detailed phase diagram in the a , r plane is reported in Fig. 2 . 

The behaviour of J\f[w(r)) versus w(r) for K = 3 and four different values of a are shown 
in Fig. 3 . 

One may observe four different phases: 

1. For a < a\ = 0.17, the curve does not touch the w = abscissa and the domains 
have volumes between the two values W\, W2 for which the ordinate vanishes. For 
r < r (TVfu^] = 0) or r > r (JV[u>i] = 0) the RS solution leads to a number of 
domains less then one and must be rejected. The freezing process takes place 
at the level of domains in that there are no domains with w values greater then 
u>2 and lower then w%. The RSB Ansatz substitutes the go order parameter with 
?i,?o- 

2. For a > 0.17, the curve starts at w = with slope r c (a); hence Af[w(r)] = 
Af[w(r c (a))} , Vr < r c (a). 
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3. At a = 5|5 = 0.277 we have r c (a) = 0. The value a = 0.277, where the 
zero temperature entropy vanishes, is simply the critical capacity of a binary 
perceptron with N/ 3 input units (the size of most numerous domains corresponds 
to the solution volume of a subperceptron) . Beyond this a value, the curve will 
be enclosed in the region of positive slope (r > 0) and the number of internal 
representations Mr it is no longer 2a In 2 (i.e. the maximal one) but is given by 
the value of M[w(r)] at the starting point of the curve: 

M R =M[w(r c (a))] . (37) 

4. At a = 0.41 the starting slope is r c (0.41) = 1 and M[w(r) = 0] = (1 - a) In 2 
(consistent with the condition g(l) — (1 — a) In 2). 

5. For a > 0.41, the point M[w(r)] = (1 — a) In 2 is off the curve and r s (ct) is the 
point at which the two solutions of the saddle point equations lead to the same 
free energy value, i.e. such that 

_A0^ + 1 
r s (a) r s (a) 

The starting point of the curve (r s (a)) grows with a. For r < r s (a), the correct 

saddle point solution is the one giving M[w(r)] = (1 — a) In 2 independently on 

r, i.e. the isolated point marked in Fig. 3 . The switch between the two solutions 

can be understood by noticing that it correspond to the only possible way of 

obtaining g{l) = (1 — a) In 2 for a < 0.41. Moreover, its physical meaning is that 

for r < r s (a) it is not necessary to distinguish among different domains in that 

Vg is dominated by the domains of zero entropy independently on the freezing 

process. 

6. For a = 0.56 only one point remains. 

7. At a = 1 also the point disappears. 

In the following section we will compare the behaviour of Mr and Md computed for 
K = 3 with the results of numerical simulations on finite systems. 
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Very schematically we have 



Mrs 



and 



Mi 



2a(ln2) a < 0.277 

A/[u;(r c (a))] 0.277 < a < 0.41 
(l-a)ln2 a > 0.41 

A/[u;(r) = 1] a < 0.41 
(l-a)in2 a > 0.41 



(39) 



(40) 



The overall scenario arising from the analytical computation may be summarized briefly 
as follows. We find a freezing transition at a 2 = 0.41 within the domains. For values of 
a > a 2 the domains, though still distributed over the whole space of solution (q = 0), 
are composed by configurations with overlap q* = 1. The point Mjj = is the symmetry 
breaking point also corresponding to the critical capacity of the model a c = 1 |J . 



IV. NUMERICAL SIMULATIONS 

We have checked the above scenario by performing two distinct sets of extended numerical 
simulations on the weight space structure of a parity machine with binary weights and three 
hidden units. 

In the first simulation we have measured both the dimension w(r) and the number 
M[w(r)] of domains depending on the loading parameter a. In particular we have considered 
the cases r = 1 and r = giving respectively the measure of the number Md of domains 
contributing to the total Gardner volume Vq and the overall number Mr of implementable 
internal representation. In the second set of simulations we have reconstructed the plot of 
rg(r) and M[w(r)] as function of r and for fixed a. 

The numerical method adopted is the exact enumeration of the configurations {</«} on 
finite systems. Very schematically the procedure is the following. 

1. choose P random patterns; 
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2. divide, for every subperceptron, the set of 2 n (n = N/3) configurations in subsets 
labeled by the vectors fe (rf = sign(Jg ■ ££)) £ = 1, 2, 3 

3. try all the subsets combinations between the three subperceptron and identify 
the domains of solutions as those which satisfy He if — 1 , V/x. 

The above scheme yields a parallel enumeration and classification of the 2 n weights 
configurations in the three subperceptron. To avoid ambiguities in the signs of the hidden 
fields the number of inputs connected to each hidden unit must be odd. The sizes of the 
systems taken under consideration are N = 15, 21, 27 for the first type of simulation and 
N = 15, 21, 27, 33 for the second. 

More in detail, the three steps of the numerical procedure are the following. 

1. We use Gaussian patterns in order to reduce finite sizes effects (as has been done 
for the binary perceptron P JI2] , , |I1J ) . From the replica method one expects that 



the results are equivalent to those of binary weights in that they depend only on 
the first two moments of the quenched variables. 

2. The classification of the 2 n weights configurations is as follows: we start with 
J = (—1, —1, . . . , —1). Next we compute for every £ and \i the field = — 
together with its sign (t\/) so that the vector fe labels the first subset. The 
subsequent J configurations are generated by means of the Gray code which flips 
just one of the J; components at each time step and allows to update the field 
values with a single operation = e$+2£^ (this reduces the number of operations 
by a factor n). Then, depending on whether the vector fe is different from the 
previous one or not, we use fe as new label of the second subset or increment 
the number of vectors contained in the first one. We thus proceed in this way 
to scan the J configurations. If P varies from 1 to 3n, every J configuration is 
classified n times on each subperceptron. At the end we obtain 3 (P fixed) or 3n 
(P varying from 1 to 3n) tables whose columns (in number < 2 n ) are the fe vectors 
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labeling the subsets and to which are associated the numbers of Je belonging to 
each subset. 

3. Finally, in the case of a given P, we take a column in each of the three tables 
and verify whether the product between the two chosen columns from the first 
two tables is equal to the column of the third one. If so, the internal represen- 
tation given by the three columns matrix is implementable and the volume of 
the corresponding domain is the product of the numbers of Ji belonging to the 
subset. 

Once the domains volumes (Vr) have been measured, we compute: 

-rg(r)= In £ V r r , (41) 
r 

-w{r) = , (42) 

(which is the domain size computed on the saddle point of the partition function) and 

J\f[w(r)] = —rg(r) +rw(r) . (43) 

For the first set of simulations, the above functions are computed just for r = 0, 1 and 
the averages are taken over 10000 (N=15), 1000 (N=21) or 50 (N=27) samples. In the case 
of the second set of simulations, in order to allow for a comparison between all the finite 
sizes considered, a is settled at a = 0.33. r runs from -1.5 to 3 and the average is done over 
10000 (N=15,N=21), 5000 (N=27) or 200 (N=33) samples. The statistical errors bars are 
within 0.1%. 

As shown in Fig. 4 , both theoretical and experimental results give g(r = 1) = — (1— a) In 2 
which coincides with the annealed approximation (so that the total volume is reduced simply 
to a half for every added pattern and a c = 1 |). At the value a = 0.277 (Fig. 5), the 
total number of internal state vectors belonging to the most numerous volumes (i.e. volumes 
characterized by r = 0) becomes non-extensive (w(r = 0) =0). Beyond such a value and in 
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perfect agreement with simulations, the correct solution is given by one step of RSB which, 
in fact, predicts w(r = 0) = , Va > 0.277. 

As shown in Fig. 6, beyond a = 0.27 the domains begin to disappear and the number of 
internal representations ceases to be constant (equal to 2a In 2) and starts to decrease with 
a. For r = 1 the freezing transition takes place at a = 0.41, see Fig.7 and Fig.8 . 

As shown in Fig.l, for a = 0.33 the theoretical value for the freezing transition is 
r c = 0.4; for r < r c the slope of the curve rg{r) is zero (it cannot become positive) and 
r g( r ) — — A/"[w(r c )] = —0.43. Finally, the plot of Af[w(r)} versus w(r), for a = 0.33, is given 
in Fig. 9. 



V. CONCLUSION 

In this paper we have applied the internal representation volumes approach to the case 
of binary multilayer networks, in particular to the non-overlapping parity machine. The 
chief result of our study consists in a detailed comparison between the analytical prediction 
and the numerical simulations, allowing for a definitive confirmation of the method. The 
detailed geometrical structure of the weights space predicted by the theory, both J\f D -J\f R 
as well as the RSB transitions within the volumes, turn out to be in remarkable agreement 
with the numerical simulations performed on finite systems. 

As a general remark, let us emphasize that multilayer neural networks with binary weights 
behave differently from their continuous counterpart. While the breaking of symmetry in 
the former occurs inside the representations volumes, we have already shown that in the case 
of real valued couplings the transition takes place between different volumes ||. Therefore, 
the richness of the distribution of internal representations found in the continuous case, i.e. 
the presence of a "finite" number of macroscopic regions in the weight space containing a 
very large number of different internal representations, is partially lost when one deals with 
discrete weights. 

The method can be easily extended [2, IT] to address the rule inference capability problem. 
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Thus, another very interesting and important issue related to the present approach would be 
the study of the distribution of metastable states arising from a gradient learning process. 
Work is in progress along these lines. 
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FIG. 1. rg(r) versus r for a = 0.33 and K = 3. The theoretical curve corresponds to the con- 
tinuous lines whereas the marked curves are the numerical results obtained for N = 15, 21, 27, 33. 

FIG. 2. Freezing transition for the binary parity machine for K = 3. The r c (a) line separates 
the RS and the RSB phases. The three marked points describe the transition at a = const, and 
correspond to the following values of the parameters and the entropy: (a) q*(r),q*(r),J\f[w(r)], 
(b) q*(r c ),q*(r c ),N[w{r c )} and (c) $J = q*(r c )/m 2 , q% = q*(r c ), q\ = 1, q\ = oo, m = r/r c , 
M[w(r)\ =M[w(r c )\. 

FIG. 3. M[w{r)]/a versus w(r) for a = 0.177,0.277,0.41,0.495. The dotted points signal the 
starting points (r c ) corresponding to w(r c ) = and the points with slope r = and r = 1. Notice 
that the diamond bolded point belongs to the dashed-dotted curve. 

FIG. 4. g(r = 1) versus a. The theoretical curve (continuous line) is compared with the 
numerical outcomes (marked points). 

FIG. 5. — w(r = 0) versus a (theoretical continuous line and numerical points). The r = 
freezing transition appears at a = 0.277. 

FIG. 6. Mr/ a versus a (theoretical continuous line and numerical points). 

FIG. 7. — w(r = 1) versus a (theoretical continuous line and numerical points). The r = 1 
freezing transition appears at a = 0.41. 

FIG. 8. Md/ol versus a (theoretical continuous line and numerical points). 

FIG. 9. M[w(r)]/a versus w(r) for fixed a = 0.33. 
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